I’m participating in this open datascience course and I’m now supposed to start my own project, work on it in RStudio, communicate it by using RMarkdown and share it via GitHub. Let’s see how frightened I will be of this “new technology”.
Check out my GitHub!!
The following has nothing to do with the actual course work. I’m just testing out stuff.
I first made the mistake of placing the datafile that I next wanted to read in, to a separate folder inside the IODS-project-folder. Therefore, when I tried to read in the file, my code wasn’t able to find it. After some googling I found out that this was because R console and Rmarkdown have separate independent working directories. You can read about this issue here.
So I start off with checking that I’m in the right working directory
getwd()
## [1] "C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project"
Everything’s in order. So now let’s read in the data that consists of my money expenditure during the past year.
money<-read.csv("Sirkesjam.csv",header=TRUE)
#let's see the first few rows
head(money)
## date amount
## 1 2018-10-26 10.40
## 2 2018-10-25 8.84
## 3 2018-10-25 7.78
## 4 2018-10-24 0.20
## 5 2018-10-23 60.00
## 6 2018-10-23 3.50
Let’s plot the data to see whether there are any nice (or worrying) patterns to see.
require(ggplot2)
ggplot(data=money,aes(date, amount))+geom_line()
For some reason the plot looks different in R compared to the knitted outcome shown above. I will get back to this later..
Here’s how the plot looks like in R:
My idea is to study whether I use more money on weekends or during holiday seasons. I could imagine that this is the case BUT, normally during weekends and holidays I go out hiking somewhere in the forests where there is less opportunities to use money so I might be wrong.
This week I’ve learned about data wrangling and linear regression analysis. The basis of all data science is efficient data wrangling. One has to be able to modify, filter, convert etc. the raw data to an appropriate format to be able to perform any statistical analysis with it.
I started my “statistical journey” from linear regression. This analysis finds out whether there are any statistically significant relationships between dependent and explanatory variables. Depending on the number of explaining factors the process is called either simple linear regression or multiple linear regression. I fitted a multiple linear regression model, interpereted the results and checked for the validity of my model.
Here’s how I did it..
Let’s start by reading in the data and exploring the structure and dimensions of the data:
#data can be read in with read.csv-command and it has headers, T means TRUE.
learning2014<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/learning2014.csv",header = T)
#str-command lists all the variables in the data, their class and first few observations
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#dim-command lists the number of rows and columns in the data
dim(learning2014)
## [1] 166 7
The data comes from a questionnaire survey that measured students learning when studying statistics. The data consists of 166 observations (individuals) and 7 variables have been measured. Here’s a list explaining the variables:
Age = Age (in years) derived from the date of birth
Attitude = Global attitude toward statistics
Points = Exam points
gender = Gender: M (Male), F (Female)
Deep = Deep approach to learning (on a scale from 1 to 5)
Surf = Surface approach to learning (on a scale from 1 to 5)
Stra = Strategic approach to learning (on a scale from 1 to 5)
You can read more about the study variables here.
Let’s explore the variation in the variables:
#let's take a summary of all the variables
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We can see that most of the students that participated in the study are female and around 25 years old. The three learning approach variables (deep, stra and surf) and the points that students have scored, all have a fairly even distribution.
Then let’s visualize how different variables affect the points that students are scoring:
# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.4
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
-Looks like students with higher (better) attitude are scoring higher points. There’s maybe a slight gender division but the main deduction is the same for both genders.
-Deep learning approach doesn’t seem to influence points scored.
-Strategic learning approach seems to have a slight positive effect on points scored.
-Surface learning approach seems to have a slight negative effect on points scored.
-Older student seem to be scoring less points.
In summary, it looks like attitude, strategic and surface learning approaches might affect points most strongly. But to be sure, we want to test it statistically.
Let’s fit a linear model where points are explained by attitude, stra and surf:
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra+surf, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the statistical test we can find out that the variables stra and surf do not have a statistically significant influence on points because their p-value (Pr(>|t|)) is greater than 0.05. This means that the points that the students get from a test are not dependent on their strategic or surface learning approach.
Attitude, however, seems to have a strong statistically significant influence on points because its p-value is way below 0.05. It seems that attitude has a great influence on points.
Let’s explore which variables we can drop as unnecessary. Probably surf but can we also drop stra?
In the following I use AIC values for model selection. You can read more about them here.
# step-command drops each variable one at a time and counts an AIC value which is the lowest possible for the best model.
step(my_model2)
## Start: AIC=557.4
## points ~ attitude + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 15.00 4559.4 555.95
## <none> 4544.4 557.40
## - stra 1 69.61 4614.0 557.93
## - attitude 1 980.95 5525.3 587.85
##
## Step: AIC=555.95
## points ~ attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## - stra 1 81.74 4641.1 556.90
## - attitude 1 1051.89 5611.3 588.41
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra
## 8.9729 3.4658 0.9137
#drop1-command does pretty much the same thing but drops the variables one at a time in the order they were in the model so you have to be careful in which order you present the variables in the model
drop1(my_model2)
## Single term deletions
##
## Model:
## points ~ attitude + stra + surf
## Df Sum of Sq RSS AIC
## <none> 4544.4 557.40
## attitude 1 980.95 5525.3 587.85
## stra 1 69.61 4614.0 557.93
## surf 1 15.00 4559.4 555.95
These analysis suggest that we should drop ‘surf’ from the model as a non-influencial variable but keep ‘stra’ still in the game. However, because the AIC difference to the next best model is less than 2, it is usually recommended to choose the simpler model.
Therefore, let’s fit a new model with only attitude:
#new model with only attitude and stra as explaining variables
my_model3 <- lm(points ~ attitude, data = learning2014)
#summary of the new model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary of the final model we first find a distribution of residuals which describes the difference between the observed values and the estimated values of the sample mean.
Next there is a coefficient table. Regression coefficients represent the mean change in the response variable (attitude) for one unit of change in the predictor variable (attitude) while holding other predictors in the model constant.
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect).
From these results I deduce that attitude has a strong positive effect.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.
In our case the R-squared value is around 20 % which seems low but in some fields, it is entirely expected that R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than physical processes.
The F statistic on the last line is telling you whether the regression as a whole is performing ‘better than random’ - any set of random predictors will have some relationship with the response, so it’s seeing whether my model fits better than I’d expect if all my predictors had no relationship with the response (beyond what would be explained by that randomness). This p-value is below 0.05 so my model makes sense.
Then it’s time for model validation. I made certain assumptions about the data when I decided to use a linear regression model in my analysis. Now I have to test that these assumptions hold and are not violated.
Let’s draw some diagnostic plots. These things are easier to interpret visually.
par(mfrow = c(2,2))
plot(my_model2,which=c(1,2,5))
The first plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers. If the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly.
In this case everything seems to be ok. There are few values (indicated with an id number) that have lower values but they are not clear outliers on my opinion.
The second plot is basically a Q-Q plot which is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
In this case I use a NORMAL Q-Q plot which checks the assumption that the dependent variable is normally distributed. This assumption seems to hold.
The last plot helps me to find influential cases if there are any. Not all outliers are influential in linear regression analysis. When cases are outside of the Cook’s distance (dashed red line not even visible in this plot), the cases are influential to the regression results. The regression results will be altered if one excludes those cases.
In this case there are no influencial cases. The red dashed line outside of which the influencial cases would be found, is not even visible in this plot.
My model seems to be a valid one. It explains around 20 % of the variation in the points that students scored in an exam. The most important factor affecting positively the points was attitude.
So check your attitude!!
This week I learned about logistic regression..
Let’s start by reading in the data and exploring the structure and dimensions of the data:
#read in the csv
alc<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/alc.csv",header = T)
#print the names of the columns
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data consists of student achievements in secondary education of two Portuguese schools. The data attributes include student grades (G1, G2 and G3), demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). These datasets have been combined so that numerical variables are averaged and categorical values are taken directly from the Mathematics dataset. Alcohol consumption of students during week days and weekends has been combined as an average and categorized yes or no for high use. Check the variable details here.
The data consists of 35 variables and 382 observations.
Next I want to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables in the data and for each of them, I present a hypothesis about their relationships with alcohol consumption.
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) I assume students with very bad realationships have high alcohol consumption.
absences - number of school absences (numeric: from 0 to 93) I assume student with lots of absences have high alcohol consumption.
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) I assume that students with low weekly study time have high alcohol consumption.
higher - wants to take higher education (binary: yes or no) I assume students who don’t want to take higher education have high alcohol consumption.
#get some packages
library(tidyr); library(dplyr); library(ggplot2)
#select only the data that I am interested in
choose<-c("high_use","famrel","absences","studytime","higher")
alc2<-select(alc,one_of(choose))
#summary table of the data
summary(alc2)
## high_use famrel absences studytime higher
## Mode :logical Min. :1.000 Min. : 0.0 Min. :1.000 no : 18
## FALSE:268 1st Qu.:4.000 1st Qu.: 1.0 1st Qu.:1.000 yes:364
## TRUE :114 Median :4.000 Median : 3.0 Median :2.000
## Mean :3.937 Mean : 4.5 Mean :2.037
## 3rd Qu.:5.000 3rd Qu.: 6.0 3rd Qu.:2.000
## Max. :5.000 Max. :45.0 Max. :4.000
#draw barplots of variables to study their distribution
gather(alc2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
None of the variables seem to be evenly distributed.
library(GGally)
# create a plot matrix with ggpairs()
p <- ggpairs(alc2, mapping = aes(col=high_use,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
-There are more students that don’t have a high use of alcohol than those that do.
-There are more students that want to take higher education than those who don’t want to.
-There are many students with very little absences, some with a few absences and then again many student with more absences.
-Family relations seem to be fairly good for most of the students.
-Most of the students study 2-5 hours weekly. Only few students study a lot.
Looks like all the hypothesis that I made hold some truth although I guess I might not have enough data from all observed categories to verify my assumptions.
#fit a logistic regression model
m <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial")
#plot a summary table
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + absences + studytime + higher -
## 1, family = "binomial", data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3321 -0.8215 -0.6473 1.1762 2.1735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## famrel -0.23369 0.12639 -1.849 0.064466 .
## absences 0.07753 0.02273 3.411 0.000647 ***
## studytime -0.52003 0.15754 -3.301 0.000963 ***
## higherno 1.13706 0.73724 1.542 0.122996
## higheryes 0.67440 0.60457 1.115 0.264638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.56 on 382 degrees of freedom
## Residual deviance: 429.40 on 377 degrees of freedom
## AIC: 439.4
##
## Number of Fisher Scoring iterations: 4
From the summary table I can see that the relationship:
-between high use and famrel is not significant but students with better (higher) relations are less likely in the high use category.
-between absences and high use is highly significant so that students with many absences are more likely in the high use category.
-between study time and high use is also highly significant so that student who spend more time studying are less likely in the high use category.
-between higher education and high use is not significant but is such that students with no willingness to take higher education are more likely in the high use category.
I study the factorial variable ‘higher’ a bit more. I check whether the variable ‘higher’ improves the model fit, I fit one model with (my.mod1) and one without the variable ‘higher’ (my.mod2) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients of higher are zero:
my.mod1 <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial")
my.mod2 <- glm(high_use ~ famrel + absences+studytime-1, data = alc2, family = "binomial")
anova(my.mod1, my.mod2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ famrel + absences + studytime + higher - 1
## Model 2: high_use ~ famrel + absences + studytime - 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 377 429.40
## 2 379 431.77 -2 -2.3695 0.3058
This test tells me that having the variable ‘higher’ in the model doesn’t improve it so I can drop it. I fit a new model without it:
#the new model again
m2 <- glm(high_use ~ famrel + absences+studytime, data = alc2, family = "binomial")
#summary table of the new model
summary(m2)
##
## Call:
## glm(formula = high_use ~ famrel + absences + studytime, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1549 -0.8286 -0.6531 1.1925 2.1845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.75052 0.59833 1.254 0.209713
## famrel -0.23533 0.12630 -1.863 0.062431 .
## absences 0.07773 0.02252 3.452 0.000557 ***
## studytime -0.54411 0.15551 -3.499 0.000467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 430.19 on 378 degrees of freedom
## AIC: 438.19
##
## Number of Fisher Scoring iterations: 4
Next I’ll try to dropping the variable ‘famrel’ because it doesn’t seem to be significant (p-value less than 0.05). I fit a model without ‘famrel’.
m3<-glm(formula = high_use ~ absences + studytime, family = "binomial",
data = alc2)
summary(m3)
##
## Call:
## glm(formula = high_use ~ absences + studytime, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2128 -0.8387 -0.7046 1.1996 2.1832
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16640 0.33954 -0.490 0.624087
## absences 0.08054 0.02285 3.524 0.000425 ***
## studytime -0.55015 0.15550 -3.538 0.000403 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 433.65 on 379 degrees of freedom
## AIC: 439.65
##
## Number of Fisher Scoring iterations: 4
The model without ‘famrel’ has a higher AIC value than the model including it. When comparing AIC values one should choose the model with the lowest AIC value. But if the difference between the best models in less than 2 units it is adviceable to choose the simpler one. Based on this advice I drop ‘famrel’ and my final model includes only absences and studytime as explaining variables.
Next I count the coefficients of the final model as odds ratios and provide confidence intervals for them:
#count odd ratios
OR <- coef(m3) %>% exp
# compute confidence intervals (CI)
CI<-confint(m3)%>%exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8467110 0.4349166 1.6508083
## absences 1.0838772 1.0386751 1.1361111
## studytime 0.5768622 0.4211716 0.7758799
Odd ratios higher than 1 means that this variable is positively associated with “success”, in this case being in the high use category. In this case, the variable ‘absences’ is positively associated with high use and ‘studytime’ is negatively associated.
From the confidence intervals I can check that 1 doesn’t occur within them because it would mean that the variable has no effect on the success. I can also observe whether the variable has a lot of variation in the intervals. A large variation in the intervals indicates a low level of precision of the odds ratio, whereas a small variation in intervals indicates a higher precision of the odds ratio.
In my case studytime has rather wide confidence intervals and therefore it might have a lower precision of the odds ratio.
To validate my model I use it to make predictions and compare them with the true observed values.
# predict() the probability of high_use
probabilities <- predict(m3, type = "response")
# add the predicted probabilities to 'alc'
alc2 <- mutate(alc2, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
# see the last ten original classes, predicted probabilities, and class predictions
select(alc2, studytime, absences, high_use, probability, prediction) %>% tail(10)
## studytime absences high_use probability prediction
## 373 1 0 FALSE 0.3281537 FALSE
## 374 1 7 TRUE 0.4618903 FALSE
## 375 3 1 FALSE 0.1497827 FALSE
## 376 1 6 FALSE 0.4419431 FALSE
## 377 3 2 FALSE 0.1603317 FALSE
## 378 2 2 FALSE 0.2486902 FALSE
## 379 2 2 FALSE 0.2486902 FALSE
## 380 2 3 FALSE 0.2640419 FALSE
## 381 1 4 TRUE 0.4026660 FALSE
## 382 1 2 TRUE 0.3645990 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 256 12
## TRUE 96 18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.25130890 0.04712042 0.29842932
## Sum 0.92146597 0.07853403 1.00000000
Above I produced two confusion tables (in absolute numbers and in percentage) from which I can see how many or what proportion of the predictions were correct.
Then the same thing as a plot:
g <- ggplot(alc2, aes(x = probability, y = high_use,col=prediction))
# define the geom as points and draw the plot
g+geom_point()
In this plot I can see both the actual values and the predictions. I have quite a lot of predictions of falses even though they were actually true. But is not as bad as predicting true even though they were falses in reality. falc
To quantify the goodness of my model I can count the proportion of incorrect predictions.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions
lfm3<-loss_func(class = alc2$high_use, prob = alc2$probability)
lfm3
## [1] 0.2827225
The result is 28 %. I could’ve counted that also from the second cross table (25 + 3 = 28 %). On my opinion that sounds like a lot but I guess it is still better than a simple guessing strategy. 72 % for correct predictions sounds better :)
The problem with this kind of model validation is that the model is asked to predict the same values that are used to create the model. This seems like a circular argument.
Another way of doing model validation is cross-validation. Here a proportion of the data is set aside as testing data and the model is fitted with training data (the rest) only. The resulting model is then fed with testing data to make predictions. The goodness of the model is then quantified as the proportion of correct predictions. This procedure is repeated several times so that the whole dataset has been used as testing data. If the cross-validation is 5-fold it means that the data are split into 5 proportions of which 1/5 is used as testing dataset one at a time. Which data to select as testing data can be random or for example spatially defined.
In R cross-validation can be done easily several times so I reapeat the following r code a couple of times:
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293
All the results that I get are close to 29-30 %. This means that with cross-validation my model seems to be performing worse than with the loss function I performed earlier (and the model used in DataCamp). I assume this is because the cross-validation procedure is more profound and therefore gives a more realistic valuation of the prediction accuracy. Without a proper model validation the results might be too over-optimistic.
Next I do a little exploration. I compare the performance of different logistic regression models (= different sets of predictors). I build models using 30, 19, 10 and 5 variables. I count training and testing errors for all the models and compare them.
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
#select 30 variables
chooseM1<-c("age","high_use","famrel","absences","studytime","higher","activities","nursery", "internet","romantic","G3","address","Pstatus","Medu","Fedu","Mjob","Fjob", "reason","guardian","traveltime","failures","schoolsup", "famsup","paid", "freetime", "goout","health","school","sex","famsize")
alcM1<-select(alc,one_of(chooseM1))
#29 explaining variables
M1<-glm(high_use ~ school+higher+sex+age+ address+ famsize+ Medu+ Fedu+ Mjob+ Fjob+ reason+ nursery+ internet+ guardian+ traveltime+ failures+ schoolsup+ famsup+ paid+ freetime+ goout+ health+absences+studytime+famrel+activities+romantic+G3+Pstatus, data = alcM1, family = "binomial")
summary(M1)
##
## Call:
## glm(formula = high_use ~ school + higher + sex + age + address +
## famsize + Medu + Fedu + Mjob + Fjob + reason + nursery +
## internet + guardian + traveltime + failures + schoolsup +
## famsup + paid + freetime + goout + health + absences + studytime +
## famrel + activities + romantic + G3 + Pstatus, family = "binomial",
## data = alcM1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8741 -0.6724 -0.3610 0.4884 2.8798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.15777 3.12129 -1.652 0.098443 .
## schoolMS -0.63603 0.54400 -1.169 0.242336
## higheryes -0.09098 0.69925 -0.130 0.896481
## sexM 0.78325 0.32823 2.386 0.017019 *
## age 0.11618 0.14915 0.779 0.435985
## addressU -0.79383 0.41437 -1.916 0.055398 .
## famsizeLE3 0.50910 0.32421 1.570 0.116356
## Medu 0.04330 0.22061 0.196 0.844378
## Fedu 0.17994 0.19563 0.920 0.357689
## Mjobhealth -1.07131 0.76315 -1.404 0.160382
## Mjobother -0.88901 0.49991 -1.778 0.075350 .
## Mjobservices -0.87464 0.57337 -1.525 0.127151
## Mjobteacher -0.28326 0.69357 -0.408 0.682977
## Fjobhealth 0.39818 1.19257 0.334 0.738465
## Fjobother 0.84816 0.86476 0.981 0.326685
## Fjobservices 1.42043 0.88485 1.605 0.108434
## Fjobteacher -0.44976 1.03350 -0.435 0.663434
## reasonhome 0.46349 0.38041 1.218 0.223067
## reasonother 1.49313 0.54221 2.754 0.005891 **
## reasonreputation -0.07508 0.41153 -0.182 0.855246
## nurseryyes -0.53587 0.36730 -1.459 0.144582
## internetyes 0.05745 0.45599 0.126 0.899736
## guardianmother -0.80193 0.36635 -2.189 0.028601 *
## guardianother -0.20393 0.79701 -0.256 0.798051
## traveltime 0.32510 0.23173 1.403 0.160644
## failures 0.22823 0.26686 0.855 0.392408
## schoolsupyes 0.07851 0.46427 0.169 0.865721
## famsupyes -0.08117 0.32407 -0.250 0.802224
## paidyes 0.59908 0.31690 1.890 0.058699 .
## freetime 0.30046 0.16728 1.796 0.072471 .
## goout 0.86758 0.15351 5.651 1.59e-08 ***
## health 0.18614 0.11321 1.644 0.100130
## absences 0.08983 0.02664 3.372 0.000747 ***
## studytime -0.31530 0.20585 -1.532 0.125599
## famrel -0.56897 0.16717 -3.404 0.000665 ***
## activitiesyes -0.57526 0.30855 -1.864 0.062265 .
## romanticyes -0.54641 0.33105 -1.651 0.098840 .
## G3 0.03375 0.05190 0.650 0.515575
## PstatusT -0.36418 0.53795 -0.677 0.498421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 322.06 on 343 degrees of freedom
## AIC: 400.06
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M1, type = "response")
# add the predicted probabilities to 'alc'
alcM1 <- mutate(alcM1, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM1 <- mutate(alcM1, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM1<-loss_func(class = alcM1$high_use, prob = alcM1$probability)
#cross-validation
cv <- cv.glm(data = alcM1, cost = loss_func, glmfit = M1, K = 10)
# average number of wrong predictions in the cross validation
cvM1<-cv$delta[1]
#####################################################################
#19 variables
chooseM2<-c("high_use","famrel","absences","studytime","activities","nursery", "romantic","address", "reason","guardian","traveltime","failures","paid", "freetime", "goout","health","school","sex","famsize")
alcM2<-select(alc,one_of(chooseM2))
#18 explaining variables
M2<-glm(high_use ~ school+sex+ address+ famsize+ reason+ nursery+ guardian+ traveltime+ failures+ paid+ freetime+ goout+health+absences+studytime+famrel+activities+romantic, data = alcM2, family = "binomial")
summary(M2)
##
## Call:
## glm(formula = high_use ~ school + sex + address + famsize + reason +
## nursery + guardian + traveltime + failures + paid + freetime +
## goout + health + absences + studytime + famrel + activities +
## romantic, family = "binomial", data = alcM2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9380 -0.6912 -0.3921 0.5668 2.7954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.24229 1.12611 -1.991 0.046461 *
## schoolMS -0.36846 0.50370 -0.732 0.464470
## sexM 0.83559 0.30472 2.742 0.006104 **
## addressU -0.86275 0.38198 -2.259 0.023905 *
## famsizeLE3 0.48723 0.30501 1.597 0.110173
## reasonhome 0.25059 0.36062 0.695 0.487133
## reasonother 1.11296 0.49712 2.239 0.025168 *
## reasonreputation -0.28950 0.37982 -0.762 0.445936
## nurseryyes -0.41960 0.33948 -1.236 0.216463
## guardianmother -0.72823 0.33009 -2.206 0.027373 *
## guardianother 0.10564 0.73524 0.144 0.885750
## traveltime 0.24550 0.21934 1.119 0.263024
## failures 0.17320 0.23209 0.746 0.455499
## paidyes 0.69857 0.29234 2.390 0.016870 *
## freetime 0.17239 0.15346 1.123 0.261274
## goout 0.84707 0.14158 5.983 2.19e-09 ***
## health 0.13123 0.10449 1.256 0.209140
## absences 0.09286 0.02425 3.830 0.000128 ***
## studytime -0.25968 0.19222 -1.351 0.176709
## famrel -0.48025 0.16126 -2.978 0.002900 **
## activitiesyes -0.43744 0.28545 -1.532 0.125416
## romanticyes -0.50198 0.30813 -1.629 0.103287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 338.38 on 360 degrees of freedom
## AIC: 382.38
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M2, type = "response")
# add the predicted probabilities to 'alc'
alcM2 <- mutate(alcM2, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM2 <- mutate(alcM2, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM2<-loss_func(class = alcM2$high_use, prob = alcM2$probability)
cv <- cv.glm(data = alcM2, cost = loss_func, glmfit = M2, K = 10)
# average number of wrong predictions in the cross validation
cvM2<-cv$delta[1]
##########################################################################3
#10 variables
chooseM3<-c("high_use","famrel","absences","address", "reason","guardian","paid", "goout","sex","famsize")
alcM3<-select(alc,one_of(chooseM3))
#9 explaining variables
M3<-glm(high_use ~ sex+ address+ famsize+ reason+ guardian+ paid+ goout+absences+famrel, data = alcM3, family = "binomial")
summary(M3)
##
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian +
## paid + goout + absences + famrel, family = "binomial", data = alcM3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9617 -0.7446 -0.4527 0.6513 2.8447
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.13120 0.78716 -2.707 0.006780 **
## sexM 0.99325 0.27671 3.589 0.000331 ***
## addressU -0.91297 0.33198 -2.750 0.005959 **
## famsizeLE3 0.44078 0.29197 1.510 0.131128
## reasonhome 0.15096 0.34314 0.440 0.659987
## reasonother 0.89873 0.46825 1.919 0.054940 .
## reasonreputation -0.53510 0.35986 -1.487 0.137017
## guardianmother -0.70149 0.31737 -2.210 0.027085 *
## guardianother 0.33524 0.69442 0.483 0.629265
## paidyes 0.52463 0.27638 1.898 0.057669 .
## goout 0.86127 0.13257 6.497 8.21e-11 ***
## absences 0.09035 0.02330 3.877 0.000106 ***
## famrel -0.43437 0.14964 -2.903 0.003699 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 354.60 on 369 degrees of freedom
## AIC: 380.6
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M3, type = "response")
# add the predicted probabilities to 'alc'
alcM3 <- mutate(alcM3, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM3 <- mutate(alcM3, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM3<-loss_func(class = alcM3$high_use, prob = alcM3$probability)
cv <- cv.glm(data = alcM3, cost = loss_func, glmfit = M3, K = 10)
# average number of wrong predictions in the cross validation
cvM3<-cv$delta[1]
##########################################################################3
#5 variables
chooseM4<-c("high_use","famrel","absences", "goout","sex")
alcM4<-select(alc,one_of(chooseM4))
#4 explaining variables
M4<-glm(high_use ~ sex+ goout+absences+famrel, data = alcM4, family = "binomial")
summary(M4)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences + famrel, family = "binomial",
## data = alcM4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7151 -0.7820 -0.5137 0.7537 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
## sexM 1.01234 0.25895 3.909 9.25e-05 ***
## goout 0.76761 0.12316 6.232 4.59e-10 ***
## absences 0.08168 0.02200 3.713 0.000205 ***
## famrel -0.39378 0.14035 -2.806 0.005020 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.81 on 377 degrees of freedom
## AIC: 389.81
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(M4, type = "response")
# add the predicted probabilities to 'alc'
alcM4 <- mutate(alcM4, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM4 <- mutate(alcM4, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM4<-loss_func(class = alcM4$high_use, prob = alcM4$probability)
cv <- cv.glm(data = alcM4, cost = loss_func, glmfit = M4, K = 10)
# average number of wrong predictions in the cross validation
cvM4<-cv$delta[1]
#create a data frame containing the number of variables and training and testing errors
Nvariables<-c(5,10,19,30)
errors<-as.data.frame(Nvariables)
errors$train=NA
errors[1, 2] = lfM4
errors[2, 2] = lfM3
errors[3, 2] = lfM2
errors[4, 2] = lfM1
errors$test=NA
errors[1, 3] = cvM4
errors[2, 3] = cvM3
errors[3, 3] = cvM2
errors[4, 3] = cvM1
#plot the results
dfplot <- errors %>% gather(key, value, -Nvariables)
ggplot(dfplot, mapping = aes(x = Nvariables, y = value, color = key) ) + geom_line()
It seems that reducing the number of explaining variables from 30 to 5 has little effect on training errors whereas for testing errors it has a significant negative effect. It reduces the testing error and models with less variables seem to give more accurate predictions.
I’m not sure what exactly I can interpret from this result because as I reduced the number of explaining variables I dropped them based on their statistical insignificance.
This week I learned about clustering and classification. How to cluster observations, how to study which factors affect or justify clustering, how many clusters is appropriate etc?
This week’s data comes from an R package called MASS:
# access the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The dataset consist of 14 variables and 506 observations. All variables are numerical. One variable (‘chas’) is a 1/0, presence/absence dummy variable. The variables describe housing values in suburbs of Boston and factors measured at the suburbs which are thought to be related with housing values. Factors include measures of for example crime rate, access to Charles River, nitrogen oxides concentration, average number of rooms per dwelling, distances to five Boston employment centres, accessibility to radial highways, proportion of blacks by town and median value of owner-occupied homes. The full details can be found here.
Let’s explore graphically the distributions and relations of the data:
pairs(Boston)
This plot is difficult to read. I’ll figure out later how to improve the quality of the output. From the summary table I’m however able to explore the variation and distributions of the variables.
Here are two dotplots of variables ‘crim’ (per capita crime rate by town) and ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft.) which show that these two variables are not very evenly distributed:
dotchart(Boston$crim)
dotchart(Boston$zn)
Some variables seem correlated. Here’s a correlation matrix of the variables:
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(magrittr)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>%round(digits=2)
# print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
The above matrix is not very readable as it extends into two separate parts. Let’s present the correlations in a nicer way.
# visualize the correlation matrix
corrplot(cor_matrix, method="circle",type="upper",cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
This plot is easier to read. The bigger the circle the more correlated the variables are. Red indicates negative correlation and blue indicated positive correlation.
Some of the variables have very high values and wide distributions. We want to scale all variables because later on it may be difficult to sum or average variables that are on different scales. Scaling can be done to all variables in the dataset as they are all numerical.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
Now all the variables have their mean at zero and their distributions are more moderate.
Next I create a categorical variable of the crime rate in the Boston dataset. I use quantiles as the break points. I drop the old crime rate variable from the dataset.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
For later model evaluation purposes I divide the dataset into training and testing datasets, so that 80% of the data belongs to the train set:
##dividing the data into training and testing sets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next I want to know which variables might explain the target variable crime rate. I do a linear discriminant analysis with the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables:
# linear discriminant analysis
lda.fit <- lda(crime~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2425743 0.2475248 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 0.92556146 -0.8877646 -0.07547406 -0.8719286 0.42503320
## med_low -0.03112371 -0.2968615 -0.03128211 -0.5979402 -0.09143903
## med_high -0.38261997 0.1594701 0.20012296 0.3402290 0.13461899
## high -0.48724019 1.0170298 -0.01233188 1.0643400 -0.40334703
## age dis rad tax ptratio
## low -0.8787887 0.8268723 -0.6821213 -0.7531703 -0.40903274
## med_low -0.3747715 0.4049195 -0.5435787 -0.4774726 -0.03413524
## med_high 0.4248520 -0.3752011 -0.4214193 -0.3285754 -0.25244681
## high 0.8192833 -0.8524791 1.6390172 1.5146914 0.78181164
## black lstat medv
## low 0.37611610 -0.74718069 0.5388863
## med_low 0.31102906 -0.18497535 0.0176242
## med_high 0.05802482 0.02292891 0.1469151
## high -0.69845498 0.86150093 -0.6615879
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11218217 0.70628844 -0.98047338
## indus 0.02894862 -0.13747087 0.45983881
## chas -0.09676340 -0.03686165 0.07751887
## nox 0.45524432 -0.76753789 -1.29674576
## rm -0.12479464 -0.15412825 -0.16712645
## age 0.18250909 -0.40913355 -0.06038953
## dis -0.08033113 -0.29342121 0.39170644
## rad 3.38226823 1.01825462 -0.12053609
## tax 0.02535344 -0.10580247 0.57598423
## ptratio 0.11788046 0.02321779 -0.26053519
## black -0.09628268 0.03445132 0.09627032
## lstat 0.23661386 -0.27344957 0.26445754
## medv 0.20889605 -0.40691365 -0.16933658
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9584 0.0320 0.0096
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)
Variable ‘rad’ looks like a strong classifying factor. Also ‘zn’ and ‘nox’ are dividing the observations.
Next I want to use the observations in the test set to predict crime classes. I do this because I want to estimate the “goodness” of my model by comparing predictions to observed “real” data.
For prediction I use the LDA model on the test data. For comparison I tabulate the results with the crime categories from the test set:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 6 2 0
## med_low 2 16 10 0
## med_high 0 7 17 2
## high 0 0 0 21
I did the random division of train and test data and predicted the above classes twice. First I got a fairly poor result with more than half of the med_high cases predicted incorrectly. On the second round the results look better (results shown here). Some classes are still incorrectly predicted but at least most of the predictions are correct.
Next I study the boston data without any classifications and try to cluster the data into groups. Maybe the observations form clusters according to the suburbs. I run k-means algorithm on the dataset, investigate what is the optimal number of clusters and run the algorithm again.
First I reload the Boston dataset and standardize it. Then I calculate the Euklidean distances between the observations and present a summary of the distances:
#standardize the data set
boston_scaled2 <- scale(Boston)
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2<-as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next I run the k-means clustering with 3 centers.
# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2[9:14], col = km$cluster)
I zoomed in to various parts of the plot and found that when looking at the variable ‘tax’ it is divided into clusters so that at least the black observations belong clearly to their own group.
I also explored the clustering with 5 centers. The grouping seemed even more arbitrary.
Now, I’m not sure about the best number of clusters so I count the total of within cluster sum of squares (WCSS) and see how it behaves when the number of clusters change:
library(ggplot2)
# set values
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS drops dramatically at around the value 2. That is the optimal number of clusters for this dataset.
I run the clustering again with 2 centers:
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster)
Now the clustering seems better, at least for some variable pairs. But on my opinion, having only two groups doesn’t tell much. Maybe it suggests that the residents in Boston are divided into two groups, the wealthy and the poor?
Next I perform the LDA again to the boston dataset, this time with clusters (3) as the target variable. By visualizing the results with a biplot I can interpret which variables influence the clustering.
boston_scaled3<-boston_scaled2
# k-means clustering
km <-kmeans(boston_scaled3, centers = 3)
klusteri<-km$cluster
class(klusteri)
## [1] "integer"
boston_scaled3<-cbind(boston_scaled3,klusteri)
summary(boston_scaled3)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv klusteri
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:1.000
## Median : 0.3808 Median :-0.1811 Median :-0.1449 Median :2.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :1.972
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:3.000
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865 Max. :3.000
# linear discriminant analysis
lda.fit2 <- lda(klusteri~., data = boston_scaled3)
# print the lda.fit object
lda.fit2
## Call:
## lda(klusteri ~ ., data = boston_scaled3)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3003953 0.4268775 0.2727273
##
## Group means:
## crim zn indus chas nox rm
## 1 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456
## 3 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125
## age dis rad tax ptratio black
## 1 0.7828949 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974
## 2 0.1673019 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397
## 3 -1.1241828 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235
## lstat medv
## 1 0.90799414 -0.69550394
## 2 -0.05818052 -0.04811607
## 3 -0.90904433 0.84137443
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.043702606 0.16161136
## zn 0.049248495 0.76920932
## indus -0.331498698 0.02870425
## chas -0.012406954 -0.11314905
## nox -0.721972554 0.40566595
## rm 0.174541989 0.41632858
## age 0.006221178 -0.88117192
## dis 0.043869924 0.36910493
## rad -1.256861546 0.47665247
## tax -0.992855786 0.46457291
## ptratio -0.092336951 -0.01003010
## black 0.073915653 -0.03513128
## lstat -0.372145848 0.38403679
## medv -0.058153798 0.49571753
##
## Proportion of trace:
## LD1 LD2
## 0.8785 0.1215
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled3$klusteri)
# plot the lda results
plot(lda.fit2, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit2, myscale = 1)
From these results I would interpret that the variable ‘rad’ (index of accessibility to radial highways) is the strongest linear separator in this dataset. Although many other variables follow not far behind.
Next I’ll draw some 3D plots of the training data. NOTE! You might have to click around to see the figure.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Set the color to be the crime classes of the train set. Draw another 3D plot where the #color is defined by the clusters of the k-means. How do the plots differ?
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train$crime)
I stop the exercise here because I’m having trouble understanding the instructions. I’m able to draw these two 3D plots and crime seems to be a strong separator in the dataset. The last plot should demonstrate the division by clusters. However, I’m not sure anymore should I do the k-means clustering again to the training data and then change the plotting code or could I do it just by modifying the color argument. I leave it here.
This week I learned about dimensionality reduction techniques.
First I read in data about human development. The data consists of several indicators that are thought to be important for a measure called ‘human development index’. This index is thought to measure the development of a country from a human perspective, so that a country’s development would not be measured only on economical terms.
human<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/human2.csv",header = T, row.names=1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ RAT.EDU: num 0.198 0.931 0.861 0.977 0.989 ...
## $ RAT.LAB: num 0.199 0.685 0.211 0.633 0.747 ...
## $ EYE : num 9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
## $ LEB : num 60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
## $ GNI : int 1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
## $ MMR : int 400 21 89 69 29 6 4 26 37 22 ...
## $ ABR : num 86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
## $ PARL : num 27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...
The data consist of 155 observations of 8 variables. Each observation (row) represents a country.
Variables explained:
MMR=maternal mortality ratio
ABR=adolescent birth rate
PARL=Percentange of female representatives in parliament
RAT.EDU=ratio of females to males with at least secondary education
RAT.LAB=ratio of females to males in the labour force
EYE=expected years of education
GNI= gross national income
LEB=life expectancy at birth
All the variables are numerical and mostly continuous.
summary(human)
## RAT.EDU RAT.LAB EYE LEB
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MMR ABR PARL
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
hist(human$GNI)
Some of the variables (for example GNI) have very wide distributions and are not normally distributed.
# Access GGally and corrplot
library(GGally)
library(corrplot)
library(magrittr)
# visualize the 'human_' variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human)%>%corrplot()
From a ggplot it is easier to see the distributions of the variables. Correlation plot helps to see the correlation between variables.
It seems that the higher the maternal mortality rates, MMR, (or adolescent birth rates, ABR), the lower the life expectancy, LEB,(or expected years of education, EYE). MMR and ABR, as well as LEB and EYE are correlated also.
I perform principal component analysis on unscales human dataset.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
From the summary table I can see that this kind of data is not suitable for the analysis. The first component supposedly is able to explain all the variation in the data set and the rest of the dimensions add nothing extra to the analysis.
From the biplot it is clear that variable GNI, gross national income, is the factor explaining everything. However, this is not case in reality. This result is just because GNI was the variable with the highest variation in the observations and therefore only “seems” to explain most of the variation. I should have standardized the data to make the variation of all the variables comparable.
Usually variables have to be standardized for a PCA so I’ll do that and then rerun the analysis.
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Now the summary table shows more reasonable amounts of variation that each principal component (one after another) is able to explain. The first component explains 53.6 %.
Also the biplot makes more sense now. The first dimension doesn’t explain all the variation in the data and the second component supplements it. The variables are more apart in this biplot and no one variable is dominating. Gross national income is not the only factor explaining differences between the observations anymore. It is now only one factor amongst others. GNI is actually together with expected years of education, life expectancy at birth and ratio of females to males with at least secondary education. These variables tell of the same of thing and are correlated. In this biplot the variations of the variables are more equal and therefore each variable is able to explain its own bit of the observations.
The second biplot shows that variables GNI, LEB, RAT.EDU and EYE are positively correlated with principal component 1, whereas ABR and MMR are negatively correlated. Variables RAT.LAB and PARL are negatively correlated with principal component 2.
In other words, the observations are mainly divided so that countries that have high gross national product have also high life expectancy at birth, high ratio of female to male with at least secondary education and high expected years of education. On the opposite, countries with high maternal mortality have also high adolescent birth rates. These two extremities are opposite to each other and in the biplot their arrows are in a 180 degree angle showing perfect negative correlation. For example, countries with high gross national product have low rates of maternal mortality.
RAT.LAB, ratio of females to males in the labour force and PARL, percentage of female in the parliament, are variables that explain the second most largest part of the variation in the data. These two variables are somewhat correlated. Countries with high ratio of females at work have also high rate of females in parliament. These two variables are also orthogonal to other variables, meaning that they are not correlated with them. For example, a country with high GNI can have high or low rates of women in parliament.
If the variables are factors instead of numerical, we have to do a multiple correspondence analysis.
library(dplyr)
library(tidyr)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
data(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
The tea dataset consists of 300 rows and 36 columns. Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables. I didn’t find exact explanations for the variables but I chose 6 of them which are quite self-explanatory if you look at the variable names and observation values. All the variables are categorical.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Variables seem to have equal (sugar) or unequal (lunch) amounts of observations on each variable level.
I could guess that people who drink green tea drink it without sugar, alone and buy it unpackaged from a tea shop.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"),habillage = "quali")
From the summary and from the biplot it seems that the dimensions are not able to explain much of the variation. The first dimension explains “only” 15 % of the variation.
There are some factor levels that are similar (close to each other in the biplot). For example, people who drink ‘unpackaged’ tea buy tea from ‘tea shops’. Makes sense. And on the opposite, people who consume ‘tea bags’ buy tea from ‘chain stores’. My assumption of people drinking unpackaged green tea bought from tea shops is not far fetched because ‘green’ is also rather similar (close to) with ‘unpackaged’ and ‘tea shop’. ‘No sugar’ and ‘alone’ are farther off but still closer than ‘sugar’ or ‘lemon’, ‘milk’ or ‘other’.
This week I learned how to handle data where the same individuals have been measured repeatedly over a time frame.
First I learned how to look for differences between groups by visually exploring the data.
#read in data
BPRSL<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/men.csv",header = T)
#structure and summary of data
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
summary(BPRSL)
## treatment subject weeks bprs week
## Min. :1.0 Min. : 1.00 week0 : 40 Min. :18.00 Min. :0
## 1st Qu.:1.0 1st Qu.: 5.75 week1 : 40 1st Qu.:27.00 1st Qu.:2
## Median :1.5 Median :10.50 week2 : 40 Median :35.00 Median :4
## Mean :1.5 Mean :10.50 week3 : 40 Mean :37.66 Mean :4
## 3rd Qu.:2.0 3rd Qu.:15.25 week4 : 40 3rd Qu.:43.00 3rd Qu.:6
## Max. :2.0 Max. :20.00 week5 : 40 Max. :95.00 Max. :8
## (Other):120
In this BPRS data set 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
I notice that R reads integers as numerical values even though when I saved the file they were factors. I convert treatment and subject again to factors.
# Factor treatment & subject in BPRS
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
Graphical displays of data are almost always useful for exposing patterns in the data. To begin we shall plot the BPRS values for all 40 men, differentiating between the treatment groups into which the men have been randomized. This simple graph makes a number of features of the data readily apparent.
#Access the package ggplot2
library(ggplot2)
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
From these plots I can see that in both treatments bprs values are generally decreasing as time goes by. There are some individuals who have high values at the beginning of the testing and at the end of the testing, whereas for some individuals the values are notably lower all the time.
Standardizing the bprs values makes it easier to compare the changes happening between individuals.
Notice that here I use the function ‘scale’ for standardization whereas in the datacamp exercise the scaling was done manually.
library(magrittr)
library(dplyr)
library(tidyr)
# Standardise the variable bprs
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = scale(bprs)) %>%
ungroup()
#check the results
glimpse(BPRSL)
## Observations: 360
## Variables: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.69836...
# Plot again with the standardised bprs
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
Standardizing the values made it easier to compare the differences between the individuals but in order to compare the differences between these two goups, the treatments I would like to summarize the results of the groups some how and compare them. However, I don’t want to loose information on the variability of the data which was clearly visible in these plots above.
To compare the differences between the two treatments I summarize the bprs values of the two groups by counting their weekly mean values and standard errors, and plot these values by groups.
# Number of weeks, baseline (week 0) included
n <- BPRSL$week %>% unique() %>% length()
# Summary data with mean and standard error of bprs by treatment and week
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise( mean = mean(bprs), se = sd(bprs)/sqrt(n) ) %>%
ungroup()
# Glimpse the data
glimpse(BPRSS)
## Observations: 18
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ week <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
## $ mean <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29....
## $ se <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2....
# Plot the mean profiles
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
geom_line() +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
The mean values of the two groups during each week seem to differ somewhat but when looking at the variability (the standard error bars) in both groups it is obvious that the values of the two groups are overlapping a lot. This suggests that the differences between the two groups might not be significant.
Summarizing the information in the dataset even more by taking the mean values of bprs of all the weeks let’s us compare the treatments as a whole. This is an example of using a summary measure approach. The mean of weeks 1 to 8 will be my summary measure.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0).
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
# Glimpse the data
glimpse(BPRSL8S)
## Observations: 40
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.1...
# Draw a boxplot of the mean versus treatment
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
The mean summary measure is more variable in the second treatment group and its distribution in this group is somewhat skew. The boxplot of the second group also reveals an outlier, a subject whose mean BPRS score of the eight weeks is over 70. It might bias the conclusions from further comparisons of the groups, so I remove that subject from the data.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
BPRSL8S1 <- BPRSL8S %>%
filter(mean < 70)
glimpse(BPRSL8S1)
## Observations: 39
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.1...
ggplot(BPRSL8S1, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
Removing the outlier lowered considerably the mean value of the treatment 2 group. I wouldn’t remove outliers unless there was a clear reason for doing so. If this value (BPRS>70) was clearly an error (typing error on the notes) or it came from a person who was somehow uncomparable with the other participants (has suffered somekind of trauma or has an unnormal medical condition).
Even though the visual exploration of the differences between the two treatments clearly suggests that there are no differences between the groups it is adviceable to test this statistically. A simple test is the t-test that compares the mean values of the two groups.
# Perform a two-sample t-test
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.232480 7.162085
## sample estimates:
## mean in group 1 mean in group 2
## 36.16875 34.70395
The t-test confirms the lack of any evidence for a group difference. Also the 95% confidence interval is wide and includes the zero, allowing for similar conclusions to be made.
The data includes information on the bprs values of each individual before any treatment was applied. This value can be called the baseline. For some individuals it was higher than for others and it also stays high throughout the treatment period. This value is correlated with the mean BPRS values and therefore it might be a good idea to take it into account when testing for differences between the groups. Adding information about the baseline in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance.
I fit a linear model with mean BPRS as the response variable and treatment and baseline as explaining factors.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
# Fit the linear model with the mean as the response
fit <- lm(mean~baseline+treatment, data = BPRSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + treatment, data = BPRSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1729 -4.5994 0.1088 4.6703 21.0656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.07897 4.55710 2.870 0.00675 **
## baseline 0.49127 0.08943 5.493 3.05e-06 ***
## treatment2 -0.58879 2.49584 -0.236 0.81480
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.872 on 37 degrees of freedom
## Multiple R-squared: 0.4494, Adjusted R-squared: 0.4196
## F-statistic: 15.1 on 2 and 37 DF, p-value: 1.605e-05
From the summary table I can see that the baseline BPRS is strongly related (p-value < 0.001) to the BPRS values taken after treatment has begun, but there is still no evidence of a treatment difference (p-value=0.81) even after conditioning on the baseline value.
Longitudinal data, where a response variable is measured on each subject on several different occasions poses problems for their analysis because the repeated measurements on each subject are very likely to be correlated rather than independent. Next I learned about methods for dealing with longitudinal data which aim to account for the correlated nature of the data and where the response is assumed to be normally distributed.
To investigate the use of linear mixed effects models I use data from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period.
#read in data
RATSL<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/rats.csv",header = T)
# Factor ID and Group
RATSL$ID <- factor(RATSL$ID)
RATSL$Group<- factor(RATSL$Group)
#structure and summary of data
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 WD1 :16 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 WD15 :16 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 WD22 :16 Median :344.5 Median :36.00
## 4 : 11 WD29 :16 Mean :384.5 Mean :33.55
## 5 : 11 WD36 :16 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 WD43 :16 Max. :628.0 Max. :64.00
## (Other):110 (Other):80
Plotting the values makes it easier to see differences between individuals and groups.
# Plot the RATSL data
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype=Group))+scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10))+scale_y_continuous(name = "Weight (grams)")+theme(legend.position = "top")
From the plot I can see that there are substantial differences between the groups of rats. Already at the beginning of the test the rats represented very different weight classes. In all the groups the rats gained weight as time went by. Maybe more so in groups 2 and 3.
If we were to ignore the fact that the measurements came from the same individuals we could interpret the values as independent and use simple linear model to analyse the data. This is wrong but I do it for didactical reasons.
I fit a linear model with weight as the response variable and time and group as explaining variables:
# create a regression model RATS_reg
RATS_reg <- lm(Weight~Time+Group,data=RATSL)
# print out a summary of the model
summary(RATS_reg)
##
## Call:
## lm(formula = Weight ~ Time + Group, data = RATSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.643 -24.017 0.697 10.837 125.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.0689 5.7725 42.281 < 2e-16 ***
## Time 0.5857 0.1331 4.402 1.88e-05 ***
## Group2 220.9886 6.3402 34.855 < 2e-16 ***
## Group3 262.0795 6.3402 41.336 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.34 on 172 degrees of freedom
## Multiple R-squared: 0.9283, Adjusted R-squared: 0.9271
## F-statistic: 742.6 on 3 and 172 DF, p-value: < 2.2e-16
From the model’s summary table I can see that both explanatory variables are statistically significant and that group 1 differs substantially from groups 2 and 3.
I don’t want to ignore the fact that measurements were taken from the same rats. I fit a random intercept model that allows the linear regression fit for each rat to differ in intercept from other rats. I include the variable ID as a random component in order to distinguish between individuals.
# access library lme4
library(lme4)
## Warning: package 'lme4' was built under R version 3.4.4
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
# Create a random intercept model
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
# Print the summary of the model
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1333.2 1352.2 -660.6 1321.2 170
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5386 -0.5581 -0.0494 0.5693 3.0990
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1085.92 32.953
## Residual 66.44 8.151
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 244.06890 11.73107 20.80
## Time 0.58568 0.03158 18.54
## Group2 220.98864 20.23577 10.92
## Group3 262.07955 20.23577 12.95
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.090
## Group2 -0.575 0.000
## Group3 -0.575 0.000 0.333
From the summary output I can interpret that there is a significant difference in the growth rates of rats in group 1 compared to groups 2 and 3.
Next I fit a random intercept and random slope model that allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the rats’ growth profiles, but also the effect of time.
# create a random intercept and random slope model
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
# print a summary of the model
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1194.2 1219.6 -589.1 1178.2 168
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2261 -0.4322 0.0555 0.5638 2.8827
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1140.5363 33.7718
## Time 0.1122 0.3349 -0.22
## Residual 19.7456 4.4436
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 246.45727 11.81526 20.859
## Time 0.58568 0.08548 6.852
## Group2 214.58736 20.17983 10.634
## Group3 258.92732 20.17983 12.831
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.166
## Group2 -0.569 0.000
## Group3 -0.569 0.000 0.333
Which one of the models is better? I perform an ANOVA test on the two models:
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref 6 1333.2 1352.2 -660.58 1321.2
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference between the two models and seems like RATS_ref1 i.e. the random intercept and slope model has a better fit. I am personally accustomed to looking at the AIC values and comparing them.
Next I want to try a model that also includes an interaction term between time and group and compare this model with the previous model without the interaction:
# create a random intercept and random slope model with the interaction
RATS_ref2 <- lmer(Weight ~ Time + Group + Time*Group + (Time | ID), data = RATSL, REML = FALSE)
# print a summary of the model
summary(RATS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + Time * Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1185.9 1217.6 -582.9 1165.9 166
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2669 -0.4249 0.0726 0.6034 2.7513
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1.107e+03 33.2763
## Time 4.925e-02 0.2219 -0.15
## Residual 1.975e+01 4.4436
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.65165 11.80279 21.321
## Time 0.35964 0.08215 4.378
## Group2 200.66549 20.44303 9.816
## Group3 252.07168 20.44303 12.330
## Time:Group2 0.60584 0.14229 4.258
## Time:Group3 0.29834 0.14229 2.097
##
## Correlation of Fixed Effects:
## (Intr) Time Group2 Group3 Tm:Gr2
## Time -0.160
## Group2 -0.577 0.092
## Group3 -0.577 0.092 0.333
## Time:Group2 0.092 -0.577 -0.160 -0.053
## Time:Group3 0.092 -0.577 -0.053 -0.160 0.333
Comparison with an anova:
# perform an ANOVA test on the two models
anova(RATS_ref2, RATS_ref1)
## Data: RATSL
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time + Group + Time * Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2
## RATS_ref2 10 1185.9 1217.6 -582.93 1165.9 12.361 2 0.00207 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like the model including the interaction term has a better fit.
Next I draw two plots. One showing the original observations and one showing fitted values from the model.
# draw the plot of RATSL with the observed Weight values
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(RATS_ref2)
# Create a new column fitted to RATSL
RATSL<-mutate(RATSL,Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(RATSL, aes(x = Time, y = Fitted, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
Visual inspection of the model shows that the model I created (with random intercept and slope as well as the interaction term) seems to fit very well. The growth rates of rats is explained by the group the rat belongs to, by the time that has passed and the combination of these two.